[SOUND] This lecture is a continued
discussion of probabilistic topic models.
In this lecture, we're going to continue
discussing probabilistic models.
We're going to talk about
a very simple case where we
are interested in just mining
one topic from one document.
So in this simple setup,
we are interested in analyzing
one document and
trying to discover just one topic.
So this is the simplest
case of topic model.
The input now no longer has k,
which is the number of topics because we
know there is only one topic and the
collection has only one document, also.
In the output,
we also no longer have coverage because
we assumed that the document
covers this topic 100%.
So the main goal is just to discover
the world of probabilities for
this single topic, as shown here.
As always, when we think about using a
generating model to solve such a problem,
we start with thinking about what
kind of data we are going to model or
from what perspective we're going to
model the data or data representation.
And then we're going to
design a specific model for
the generating of the data,
from our perspective.
Where our perspective just means we want
to take a particular angle of looking at
the data, so that the model will
have the right parameters for
discovering the knowledge that we want.
And then we'll be thinking
about the microfunction or
write down the microfunction to
capture more formally how likely
a data point will be
obtained from this model.
And the likelihood function will have
some parameters in the function.
And then we argue our interest in
estimating those parameters for example,
by maximizing the likelihood which will
lead to maximum likelihood estimated.
These estimator parameters
will then become the output
of the mining hours,
which means we'll take the estimating
parameters as the knowledge
that we discover from the text.
So let's look at these steps for
this very simple case.
Later we'll look at this procedure for
some more complicated cases.
So our data, in this case is, just
a document which is a sequence of words.
Each word here is denoted by x sub i.
Our model is a Unigram language model.
A word distribution that we hope to
denote a topic and that's our goal.
So we will have as many parameters as many
words in our vocabulary, in this case M.
And for convenience we're
going to use theta sub i to
denote the probability of word w sub i.
And obviously these theta
sub i's will sum to 1.
Now what does a likelihood
function look like?
Well, this is just the probability
of generating this whole document,
that given such a model.
Because we assume the independence in
generating each word so the probability of
the document will be just a product
of the probability of each word.
And since some word might
have repeated occurrences.
So we can also rewrite this
product in a different form.
So in this line, we have rewritten
the formula into a product
over all the unique words in
the vocabulary, w sub 1 through w sub M.
Now this is different
from the previous line.
Well, the product is over different
positions of words in the document.
Now when we do this transformation,
we then would need to
introduce a counter function here.
This denotes the count of
word one in document and
similarly this is the count
of words of n in the document
because these words might
have repeated occurrences.
You can also see if a word did
not occur in the document.
It will have a zero count, therefore
that corresponding term will disappear.
So this is a very useful form of
writing down the likelihood function
that we will often use later.
So I want you to pay attention to this,
just get familiar with this notation.
It's just to change the product over all
the different words in the vocabulary.
So in the end, of course, we'll use
theta sub i to express this likelihood
function and it would look like this.
Next, we're going to find
the theta values or probabilities
of these words that would maximize
this likelihood function.
So now lets take a look at the maximum
likelihood estimate problem more closely.
This line is copied from
the previous slide.
It's just our likelihood function.
So our goal is to maximize
this likelihood function.
We will find it often easy to
maximize the local likelihood
instead of the original likelihood.
And this is purely for
mathematical convenience because after
the logarithm transformation our function
will becomes a sum instead of product.
And we also have constraints
over these these probabilities.
The sum makes it easier to take
derivative, which is often needed for
finding the optimal
solution of this function.
So please take a look at this sum again,
here.
And this is a form of
a function that you will often
see later also,
the more general topic models.
So it's a sum over all
the words in the vocabulary.
And inside the sum there is
a count of a word in the document.
And this is macroed by
the logarithm of a probability.
So let's see how we can
solve this problem.
Now at this point the problem is purely a
mathematical problem because we are going
to just the find the optimal solution
of a constrained maximization problem.
The objective function is
the likelihood function and
the constraint is that all these
probabilities must sum to one.
So, one way to solve the problem is
to use Lagrange multiplier approace.
Now this command is beyond
the scope of this course but
since Lagrange multiplier is a very
useful approach, I also would like
to just give a brief introduction to this,
for those of you who are interested.
So in this approach we will
construct a Lagrange function, here.
And this function will combine
our objective function
with another term that
encodes our constraint and
we introduce Lagrange multiplier here,
lambda, so it's an additional parameter.
Now, the idea of this approach is just to
turn the constraint optimization into,
in some sense,
an unconstrained optimizing problem.
Now we are just interested in
optimizing this Lagrange function.
As you may recall from calculus,
an optimal point
would be achieved when
the derivative is set to zero.
This is a necessary condition.
It's not sufficient, though.
So if we do that you will
see the partial derivative,
with respect to theta i
here ,is equal to this.
And this part comes from the derivative
of the logarithm function and
this lambda is simply taken from here.
And when we set it to zero we can
easily see theta sub i is
related to lambda in this way.
Since we know all the theta
i's must a sum to one
we can plug this into this constraint,
here.
And this will allow us to solve for
lambda.
And this is just a net
sum of all the counts.
And this further allows us to then
solve the optimization problem,
eventually, to find the optimal
setting for theta sub i.
And if you look at this formula it turns
out that it's actually very intuitive
because this is just the normalized
count of these words by the document ns,
which is also a sum of all
the counts of words in the document.
So, after all this mess, after all,
we have just obtained something
that's very intuitive and
this will be just our
intuition where we want to
maximize the data by
assigning as much probability
mass as possible to all
the observed the words here.
And you might also notice that this is
the general result of maximum likelihood
raised estimator.
In general, the estimator would be to
normalize counts and it's just sometimes
the counts have to be done in a particular
way, as you will also see later.
So this is basically an analytical
solution to our optimization problem.
In general though, when the likelihood
function is very complicated, we're not
going to be able to solve the optimization
problem by having a closed form formula.
Instead we have to use some
numerical algorithms and
we're going to see such cases later, also.
So if you imagine what would we
get if we use such a maximum
likelihood estimator to estimate one
topic for a single document d here?
Let's imagine this document
is a text mining paper.
Now, what you might see is
something that looks like this.
On the top, you will see the high
probability words tend to be those very
common words,
often functional words in English.
And this will be followed by
some content words that really
characterize the topic well like text,
mining, etc.
And then in the end,
you also see there is more probability of
words that are not really
related to the topic but
they might be extraneously
mentioned in the document.
As a topic representation,
you will see this is not ideal, right?
That because the high probability
words are functional words,
they are not really
characterizing the topic.
So my question is how can we
get rid of such common words?
Now this is the topic of the next module.
We're going to talk about how to use
probabilistic models to somehow get rid of
these common words.
[MUSIC]

